Manifestation of depression in speech overlaps with characteristics used to represent and recognize speaker identity

The sound of a person’s voice is commonly used to identify the speaker. The sound of speech is also starting to be used to detect medical conditions, such as depression. It is not known whether the manifestations of depression in speech overlap with those used to identify the speaker. In this paper, we test the hypothesis that the representations of personal identity in speech, known as speaker embeddings, improve the detection of depression and estimation of depressive symptoms severity. We further examine whether changes in depression severity interfere with the recognition of speaker’s identity. We extract speaker embeddings from models pre-trained on a large sample of speakers from the general population without information on depression diagnosis. We test these speaker embeddings for severity estimation in independent datasets consisting of clinical interviews (DAIC-WOZ), spontaneous speech (VocalMind), and longitudinal data (VocalMind). We also use the severity estimates to predict presence of depression. Speaker embeddings, combined with established acoustic features (OpenSMILE), predicted severity with root mean square error (RMSE) values of 6.01 and 6.28 in DAIC-WOZ and VocalMind datasets, respectively, lower than acoustic features alone or speaker embeddings alone. When used to detect depression, speaker embeddings showed higher balanced accuracy (BAc) and surpassed previous state-of-the-art performance in depression detection from speech, with BAc values of 66% and 64% in DAIC-WOZ and VocalMind datasets, respectively. Results from a subset of participants with repeated speech samples show that the speaker identification is affected by changes in depression severity. These results suggest that depression overlaps with personal identity in the acoustic space. While speaker embeddings improve depression detection and severity estimation, deterioration or improvement in mood may interfere with speaker verification.

256 units). The linear layer performs an affine transformation on the last frame response of the LSTM layers. The output of the linear layer of the network is denoted as f (x ji ; θ ) where θ represents parameters of the entire neural network. The embedding vector (also known as d-vector) is defined as the L2 normalization of the final layer output: where e ji represents the embedding vector obtained for the i th utterance of j th speaker. For each speaker in the batch is computed where the centroid c j (which represents the voice print) of the j th speaker is obtained by computing the mean of all the embedding vectors [e j1 , e j2 , · · · e jM ] corresponding to j th speaker. Then a similarity matrix S is computed for each batch, with N × M rows and N columns. An element S ji,k in the similarity matrix is defined as the scaled cosine similarity between each embeddings vector e ji and all the centroids c k (1 ≤ j, k ≤ N and 1 ≤ i ≤ M): S ji,k = w · cos(e ji , c k ) + b, where w and b are learnable parameters. The model was trained using softmax on S ji,k , which outputs 1 if j = k, otherwise outputs 0. The softmax loss on embeddings vector e ji can be defined as: L(e ji ) = S ji, j − log N ∑ k=1 exp(S ji,k ).
This loss function enables the network to learn parameters such that embeddings vectors corresponding to j th speaker are pulled close to the centroid c j and at the same time pushed away from other centroids corresponding to other speakers. We used the pre-trained models for extracting speaker embeddings (x-vector, ECAPA-TDNN x-vectors and d-vectors) at segment-level for each of the DAIC-WOZ and FORBOW datasets. Each segment is represented using a speaker embedding of dimension 512, 256, and 192 for x-vector, ECAPA-TDNN x-vector and d-vector, respectively. Finally, we use these speaker embeddings separately to train and test the LSTM and CNN based models for depression detection i.e., LSTM and CNN models were trained separately on x-vector, ECAPA-TDNN x-vector and d-vector speaker embeddings.

Additional Results
Depression detection using speaker embeddings: Supplementary Table S1 shows the depression assessment results obtained using different speaker embeddings i.e., x-vector, ECAPA-TDNN x-vectors and d-vector speaker embeddings. It can be observed from Supplementary Table S1 that all the three types of speaker embeddings achieve SOTA performance on depression assessment, with ECAPA-TDNN x-vectors and d-vectors achieving better performance than x-vectors.
Supplementary Table S2 shows the depression assessment results obtained by combining speaker embeddings (x-vector, ECAPA-TDNN x-vectors and d-vector) with acoustic features (COVAREP and OpenSMILE). It can be observed that the depression assessment performance improved when the speaker embeddings were combined with the acoustic features. Best performance was achieved when ECAPA-TDNN x-vectors are combined with OpenSMILE features.
Speaker Embeddings Depression detection using demographic information: In order to understand the significance of the demographic variables such as biological sex and age in detecting depression, we trained different machine learning models (decision trees, support vector machines (SVM) and deep neural networks (DNNs)) for the task of depression detection using: (1) only biological     Table S3 shows the performance of different models trained using demographic variables for the task of depression detection. When only biological sex or only age was used to train the models, the models were biased towards the majority class i.e., the models were always predicting the output as healthy (and never as depressed) irrespective of the input value (Sensitivity = 0.0 and specificity = 1.0). This shows that just the gender or age might not provide sufficient information to detect depression. When both gender and age were used to train the models, models were still not able to perform depression detection nearly as well as in the case of using speaker embeddings (ECAPA-TDNN x-vectors) for depression detection. This indicates that speaker embeddings are capturing more information more than just the biological sex and age. This may also be the reason for the improved emotion classification performance using x-vector speaker embeddings 10 . showing that there are no significant differences in age between the depressed and healthy participants. This explains the low-performance of the machine learning models trained using age as input.  Table S4. Gender-based depression detection performance using LSTM models with ECAPA-TDNN x-vectors as input. Female, Male refers to the model trained and tested using female and male subsets, respectively. Female (Under-sampled) is obtained by under-sampling the female subset to match the distribution of the male subset of the Vocal Mind dataset. Imbalanced ratio (IR) refers to the ratio of non-depressed to depressed samples. BAc. refers to balanced accuracy Gender-specific depression detection: Previous works have pointed out that the non-uniform distribution of gender-based samples in terms of depressed and healthy participants, in the DAIC-WOZ dataset, led to overestimated performance of the machine leaning models 11 . This is because of the models simply learning the gender-specific information from the voice. In order to analyze contribution of the gender-agnostic information contained in speaker embeddings for depression detection, we performed gender-specific depression detection as done in previous works 12,13 . In these experiments, we divide the entire dataset into two gender-based subsets -one set with only female speakers and the other set with only male speakers. We then performed 5-fold cross validation on each subset separately. Supplementary Table  S4 provides the gender-specific performance of the LSTM model with ECAPA-TDNN speaker embeddings as input.

Model
For the DAIC-WOZ dataset (see Supplementary Table S4 (a)), both Female and Male models have similar performance with the Female model performing slightly better than the Male model. This shows that depression detection using speaker embeddings was not simply relying on gender-based information. The difference in performance between the two models may be due to the difference in imbalance ratio of non-depressed to depressed samples in each gender: for females, the imbalance ratio of non-depressed to depressed is 59 : 31 ≈ 2:1 whereas for males the imbalance ratio of non-depressed to depressed is 95 : 34 ≈ 3:1.
For the Vocal Mind dataset (see Supplementary Table S4 (b)), there is a large difference between the performance of the Female and the Male models, with the Female model performing better than the Male model. This might be accounted for by the large difference in the imbalance ratio between female participants (294:95 ≈ 3:1) and male participants (109:16 ≈ 7:1). In order to understand the effect of imbalance ratio on gender-based model performance, we under-sampled the female samples to match that of the male i.e., randomly selected 125 female samples (109 non-depressed and 16 depressed). We then performed 5-fold cross validation on this under sampled set (Female (Under-sampled)). It can be observed that the performance of the model on the Female (under-sampled) is similar to the performance of the Male model. This shows the effect of class imbalance on the model performance.
Supplementary Table S5 shows the confusion matrices obtained using (a-c) DAIC-WOZ dataset and (d-f) Vocal Mind dataset. It can be observed that for both datasets, models trained by combining ECAPA-TDNN with OpenSMILE features better identify people with depression compared to the models trained using only ECAPA-TDNN features. Further, the no information system predicts every person to be healthy (majority class) irrespective of the input i.e., it is unable to detect people with depression. Figure S1 shows the depression detection performance on the DAIC-WOZ (DAIC) and Vocal Mind (VM) datasets by considering different temporal contexts. We use two different input configurations: the first configuration uses only speaker embeddings, while the second configuration uses a combination of speaker embeddings and OpenSMILE (Spk-Emb, OS) features. Our temporal contexts range from 20 seconds (4 contiguous sub-segments) to 80 or 100 seconds (16 or 20 contiguous sub-segments). As we increase the context, the depression detection performance improves until saturation. For instance, Supplementary Figure S1 shows that for the CE l model trained using combined speaker embeddings and OpenSMILE on the Vocal Mind dataset (i.e., VM: Spk-Emb, OS), as we increase the temporal context up to 16 segments, the performance of the CE l model improves to an accuracy of 0.76 -indicating that the temporal relationship embodied across the segments of a speech recording provide essential cues for depression detection.

Related Work
Acoustic Representations for Depression Analysis: Several modalities such as text, speech, electronic health records, and wearable and mobile sensors were used for depression classification and severity estimation [14][15][16][17] . Speech is one such modality which attained a lot of research attention in recent times [18][19][20] . Depression is shown to degrade cognitive planning and psycho-motor functioning, thus affecting the human speech production mechanism 19 . These affects manifest as variations in the speech voice quality 21 and several features have been proposed to capture these variations in speech for depression analysis. Spectral features such as formants and mel-frequency cepstral coefficients (MFCCs), prosodic features such as F 0 , jitter, shimmer and glottal features were initially used for depression detection [22][23][24] . Spectral, prosodic and other voice quality related features extracted using OpenSMILE 25 and COVAREP 26 toolkits were also used for depression analysis 27,28 . Further, features developed based on speech articulation such as vocal tract coordination features were analyzed for depression detection 21,29,30 . Recently, sentiment and emotion embeddings, representing non-verbal characteristics of speech, were used for depression severity estimation 31 . To the best of our knowledge, no other studies that we know of have explored the use

5/8
of speaker-specific information for depression detection. In this work, we consider using speaker embeddings for depression analysis.
Speaker Embeddings: Speaker embeddings refer to a low-dimensional representation of the speaker-specific characteristics that exist in the speech signal 2, 32 and can be designed to be relatively independent of what the speaker is saying. Speaker representations were initially based on i-vectors, with a probabilistic linear discriminant analysis (PLDA) back-end 33 . Recently, two distinct end-to-end deep neural network based approaches were used for speaker verification, and both approaches obtained (comparable) SOTA performance 2,9 . In Snyder et al., speaker embeddings, also referred to as x-vectors, were extracted from a time-delay deep neural network (TDNN) trained for the task of speaker verification 2 . In contrast, speaker embeddings, also referred to as d-vectors, were extracted from an end-to-end LSTM network trained for speaker verification 9 . Subsequent improvements to the TDNN architecture of x-vectors have further improved the performance of speaker verification 4,34 . In Emphasized Channel Attention, Propagation and Aggregation TDNN (ECAPA-TDNN 4 ), the TDNN architecture was enhanced for the task of speaker classification by introducing improvements related to channel attention, propagation and aggregation. In this work, we analyze three different variants of speaker embeddings i.e., x-vectors 2 , d-vectors 9 and ECAPA-TDNN x-vectors 4 .
Deep learning for depression diagnosis: Recently, the application of deep learning techniques have significantly boosted the performance of depression detection using speech 28,29,[35][36][37] . Initially deep neural networks (DNNs) with fully-connected layers were trained for depression detection 35 . Later, convolutional neural networks (CNNs) and recurrent neural networks with long short-term memory (LSTM) units were shown to achieve better performance on depression detection 28,37 . Recently, CNN-LSTM and dilated CNNs were used for depression detection from speech to achieve SOTA performance 29,36 . In this work, we use speaker embeddings to train multi-kernel CNNs 38 , and LSTM models for depression analysis.
Temporal Context in Depression Detection: A few studies have analyzed the significance of the total duration of the audio recording on the depression detection performance [39][40][41] . These works have shown that longer the duration, better the performance. In Yang et al. and Pampouchidou et al., the analysis was performed by considering multiple modalities i.e, audio, visual and text 39,40 , whereas in Rutowski et al., automatic speech-to-text transcriptions were used to analyze the effect of duration on depression detection performance 41 . In this work, we use the acoustic features extracted from speech to analyze the effect of varying the number of contiguous speech segments on the performance of LSTM and CNN models trained for depression detection.